Abstract
Air Quality Index (AQI) is a tool to measure how polluted the air is currently or to estimate how polluted the air can become. Concentrations of various pollutants in the air are used to calculate the air quality index of that area. Awareness of daily levels of air pollution is important to the citizens of a country, especially for those who suffer from illnesses caused due to air pollution. Air quality standards are the basic foundation that provides legal frameworks for air pollution control. The basis of development of standards is to provide a rationale for protecting public health from adverse effects of air pollutants, to eliminate or reduce exposure to hazardous air pollutants, and to guide national/ local authorities for pollution control decisions. Currently, 23 parameters are used to measure the quality of air by the government of India. Of these parameters, pollutants/parameters of major public health concerns currently include particulate matter, carbon monoxide, ozone, nitrogen dioxide, and sulfur dioxide.
The National Air Quality Index (AQI) was launched in India in September 2014, under the Swachh Bharat Abhiyan. The Central Pollution Control Board (CPCB) along with State Pollution Control Boards (SPCB) have been operating National Air Monitoring Program (NAMP) covering 240 cities of the country having more than 342 monitoring stations. The AQI for India has 6 categories. They are:
Aim
Through this project, I aim to look at the Air Quality Index of the Indian city of Bangalore through parameters such as particulate matter, carbon monoxide, sulphur dioxide, etc. I aim to look at the patterns in measurements of concentrations of these pollutants in the air over the years.
Intent
When I was researching about topics related to pollution, I have come across the term “Air Quality Index”. I have researched about it in order to understand what it means. After reading up about it, I have realised that air pollution in itself has numerous variables involved. I initially started to look for the factors/ contributors to air pollution, but soon, I have changed my direction of research and started looking up at the various components of air pollution.
Through this project, I intend to look at the constituents of air pollution. In order to tackle a problem, one must look beyond the causes to know till what extent the damage is. Likewise, in this case, in order to understand better and work on solutions for air pollution, one must also look at how it has affected the environment rather than what is causing it. Another important aspect of this topic is the impact of air pollution on humans. The parameters set for Air Quality Index Check are based on the effects on humans.
What made me choose this topic for my project is mainly because I wanted to look at and work with more quantitative research and data as the primary focus. Since I already had a basic understanding about the topic due to my preliminary research, I decided to work with it. Another important factor that led me to pick this topic is the daily news that I read about the “horrible” living conditions in metropolitan cities. The news is about air pollution and its adverse effects on human health. So, I decided to look at what is it that is harmful in the air, i.e., what are the harmful components and their concentration that are polluting the air.
Research protocol
In order to look at the AQI for Bangalore, I started by researching about the process of data collection for the values of air quality. For this, I have visited the sites of the Central Pollution Control Board (CPCB) and Karnataka State Pollution Control Board (KSPCB). I have learnt that there are multiple locations within each city from which the data for calculating AQI is collected. For Bangalore city, there are 5 main AQI data collection points.
The research protocol I have followed for this project:
Research for this project
For this project, I have collected data on the concentrations of 6 pollutants in each of the locations.
Locations:
Pollutants:
Granularity: daily (values of daily average calculated at the end of every day)
Importing the required csv files
# read the csv file and store it as blr_data
blr_data = read.csv("blr_final.csv", stringsAsFactors = FALSE)
# read the csv file on latitudes and longitudes and store it as lat_lon
lat_lon = read.csv("lat_lon_blr.csv", stringsAsFactors = FALSE)
#changing the variable type for the data
library(readr)
blr_final <- read_csv("blr_final.csv", col_types = cols(concentration = col_number(),
obv_date = col_date(format = "%d-%m-%Y")))
blr_data = blr_final
Importing the required packages
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(ggplot2)
library(stringr)
library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ tibble 2.1.3 ✓ forcats 0.4.0
## ✓ purrr 0.3.4
## ── Conflicts ──────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(lubridate)
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
library(ggmap)
## Google's Terms of Service: https://cloud.google.com/maps-platform/terms/.
## Please cite ggmap if you use it! See citation("ggmap") for details.
library(wesanderson)
require(maps)
## Loading required package: maps
##
## Attaching package: 'maps'
## The following object is masked from 'package:purrr':
##
## map
Importing the required packages
dim(blr_data)
## [1] 21900 5
str(blr_data)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 21900 obs. of 5 variables:
## $ id : num 1 2 3 4 5 6 7 8 9 10 ...
## $ location : chr "btm" "btm" "btm" "btm" ...
## $ obv_date : Date, format: "2018-01-01" "2018-01-02" ...
## $ parameter : chr "co" "co" "co" "co" ...
## $ concentration: num NA NA NA NA NA NA NA NA 0.82 0.55 ...
## - attr(*, "problems")=Classes 'tbl_df', 'tbl' and 'data.frame': 2710 obs. of 5 variables:
## ..$ row : int 1 2 3 4 5 6 7 8 169 170 ...
## ..$ col : chr "concentration" "concentration" "concentration" "concentration" ...
## ..$ expected: chr "a number" "a number" "a number" "a number" ...
## ..$ actual : chr "None" "None" "None" "None" ...
## ..$ file : chr "'blr_final.csv'" "'blr_final.csv'" "'blr_final.csv'" "'blr_final.csv'" ...
## - attr(*, "spec")=
## .. cols(
## .. id = col_double(),
## .. location = col_character(),
## .. obv_date = col_date(format = "%d-%m-%Y"),
## .. parameter = col_character(),
## .. concentration = col_number()
## .. )
table(blr_data$parameter)
##
## co no no2 nox pm10 pm2.5 so2
## 3650 3650 3650 3650 1460 2190 3650
table(blr_data$location)
##
## btm bwssb crs peenya sghalli
## 4380 4380 4380 4380 4380
table(blr_data$obv_month)
## < table of extent 0 >
summary(blr_data)
## id location obv_date parameter
## Min. : 1 Length:21900 Min. :2018-01-01 Length:21900
## 1st Qu.: 5476 Class :character 1st Qu.:2018-07-02 Class :character
## Median :10950 Mode :character Median :2018-12-31 Mode :character
## Mean :10950 Mean :2018-12-31
## 3rd Qu.:16425 3rd Qu.:2019-07-02
## Max. :21900 Max. :2019-12-31
##
## concentration
## Min. : 0.00
## 1st Qu.: 3.23
## Median : 10.51
## Mean : 21.13
## 3rd Qu.: 28.77
## Max. :567.61
## NA's :2710
modifying the bangalore data table
# adding some columns to the table
variable_from_date <- as.Date(blr_data$obv_date,'%m/%d/%Y')
# extracting day, month, year from the date column
years_from_date <- as.numeric(format(variable_from_date,'%Y'))
months_from_date <- month(blr_data$obv_date, label = TRUE, abbr = TRUE)
days_from_date <- as.numeric(format(variable_from_date,'%d'))
# adding the extracted day, month, year from the date column as individual columns
blr_data$observed_year <- years_from_date
blr_data$observed_month <- months_from_date
blr_data$observed_days <- days_from_date
# converting all values to lower case
blr_data$location = str_to_lower(blr_data$location)
blr_data$parameter = str_to_lower(blr_data$parameter)
blr_data$observed_month = str_to_lower(blr_data$observed_month)
#converting values of months as suitable to plot a graph
months_order <- c("jan", "feb", "mar",
"apr", "may", "jun",
"jul", "aug", "sep",
"oct", "nov", "dec")
blr_data$months_in_order <- factor(blr_data$observed_month, levels = months_order)
PLOTTING GRAPHS
To begin with, let us look at the location of the AQI data collection locations in Bangalore
latitude_val <- c(12.9,13.05)
longitude_val <- c(77.52,77.7)
bbox <- make_bbox(longitude_val,latitude_val)
b <- get_map(bbox,maptype="toner",source="stamen", color = "bw")
## Source : http://tile.stamen.com/terrain/12/2929/1898.png
## Source : http://tile.stamen.com/terrain/12/2930/1898.png
## Source : http://tile.stamen.com/terrain/12/2931/1898.png
## Source : http://tile.stamen.com/terrain/12/2932/1898.png
## Source : http://tile.stamen.com/terrain/12/2929/1899.png
## Source : http://tile.stamen.com/terrain/12/2930/1899.png
## Source : http://tile.stamen.com/terrain/12/2931/1899.png
## Source : http://tile.stamen.com/terrain/12/2932/1899.png
## Source : http://tile.stamen.com/terrain/12/2929/1900.png
## Source : http://tile.stamen.com/terrain/12/2930/1900.png
## Source : http://tile.stamen.com/terrain/12/2931/1900.png
## Source : http://tile.stamen.com/terrain/12/2932/1900.png
ggmap(b) +
geom_point(data = lat_lon,
aes(lon,lat, color = location),
size = 4) +
scale_color_manual(values = c("darkred",
"blue",
"darkgreen",
"orange",
"magenta")) +
labs(x = "Longitude", y = "Latitude",
title="Location of the AQI data collection points", color = "Locations")
Let us now look at the graphs for individual parameters
# graph 1.1: Sulphur dioxide concentration in Bangalore in year 2018
ggplot((filter(blr_data,
parameter == "so2",
observed_year == "2018")),
aes(x = observed_days, y = log(concentration))) +
geom_point(aes(color = location), size = 0.5) +
facet_wrap(~months_in_order) +
labs(title = "Sulphur dioxide (SO2) concentration in Bangalore in year 2018",
x = "Day of observation",
y = "Log(concentration of SO2 in ug/m3)")
# graph 2.1: Sulphur dioxide concentration in Bangalore in year 2019
ggplot((filter(blr_data,
parameter == "so2",
observed_year == "2019")),
aes(x = observed_days, y = log(concentration))) +
geom_point(aes(color = location), size = 0.5) +
facet_wrap(~months_in_order) +
labs(title = "Sulphur dioxide (SO2) concentration in Bangalore in year 2019",
x = "Day of observation",
y = "Log(concentration of SO2 in ug/m3)")
# graph 1.2: Nitrogen dioxide concentration in Bangalore in year 2018
ggplot((filter(blr_data,
parameter == "no2",
observed_year == "2018")),
aes(x = observed_days, y = log(concentration))) +
geom_point(aes(color = location), size = 0.5) +
facet_wrap(~months_in_order) +
labs(title = "Nitrogen dioxide (NO2) concentration in Bangalore in year 2018",
x = "Day of observation",
y = "Log(concentration of NO2 in ug/m3)")
# graph 2.2: Nitrogen dioxide concentration in Bangalore in year 2019
ggplot((filter(blr_data,
parameter == "no2",
observed_year == "2019")),
aes(x = observed_days, y = log(concentration))) +
geom_point(aes(color = location), size = 0.5) +
facet_wrap(~months_in_order) +
labs(title = "Nitrogen dioxide (NO2) concentration in Bangalore in year 2019",
x = "Day of observation",
y = "Log(concentration of NO2 in ug/m3)")
# graph 1.3: Carbon monoxide concentration in Bangalore in year 2018
ggplot((filter(blr_data,
parameter == "co",
observed_year == "2018")),
aes(x = observed_days, y = log(concentration))) +
geom_point(aes(color = location), size = 0.5) +
facet_wrap(~months_in_order) +
labs(title = "Carbon monoxide (CO) concentration in Bangalore in year 2018",
x = "Day of observation",
y = "Log(concentration of CO in mg/m3)")
# graph 2.3: Carbon monoxide concentration in Bangalore in year 2019
ggplot((filter(blr_data,
parameter == "co",
observed_year == "2019")),
aes(x = observed_days, y = log(concentration))) +
geom_point(aes(color = location), size = 0.5) +
facet_wrap(~months_in_order) +
labs(title = "Carbon monoxide (CO) concentration in Bangalore in year 2019",
x = "Day of observation",
y = "Log(concentration of CO in mg/m3)")
# graph 1.4: Particulate matter concentration in Bangalore in year 2018
ggplot((filter(blr_data,
parameter %in% c("pm10", "pm2.5"),
observed_year == "2018")),
aes(x = observed_days, y = log(concentration))) +
geom_point(aes(color = location), size = 0.5) +
facet_wrap(~months_in_order) +
labs(title = "Particulate matter concentration in Bangalore in year 2018",
x = "Day of observation",
y = "Log(concentration of PM in ug/m3)")
# graph 2.4: Particulate matter concentration in Bangalore in year 2019
ggplot((filter(blr_data,
parameter %in% c("pm10", "pm2.5"),
observed_year == "2019")),
aes(x = observed_days, y = log(concentration))) +
geom_point(aes(color = location), size = 0.5) +
facet_wrap(~months_in_order) +
labs(title = "Particulate matter concentration in Bangalore in year 2019",
x = "Day of observation",
y = "Log(concentration of PM in ug/m3)")
If we look at these graphs, we can see that there are some abnormal spikes in values on random days for all parameters. Let us take a closer look at the month of December to get a detailed look
# graph 3.1: Sulphur dioxide concentration in Bangalore in December 2018, 2019
ggplot((filter(blr_data,
parameter == "so2",
observed_month == "dec",)),
aes(x = observed_days, y = log(concentration))) +
geom_point(aes(color = location), size = 1) +
geom_line(aes(color = location)) +
facet_wrap(~observed_year) +
labs(title = "Sulphur dioxide (SO2) concentration in Bangalore in December",
x = "Day of observation",
y = "Log(concentration of SO2 in ug/m3)")
# graph 3.2: Particulate matter concentration in Bangalore in December 2018, 2019
ggplot((filter(blr_data,
parameter %in% c("pm10", "pm2.5"),
observed_month == "dec",)),
aes(x = observed_days, y = log(concentration))) +
geom_point(aes(color = location), size = 1) +
geom_line(aes(color = location)) +
facet_wrap(~observed_year) +
labs(title = "Particulate matter (PM) concentration in Bangalore in December",
x = "Day of observation",
y = "Log(concentration of PM in ug/m3)")
Let us look at the graphs for the yearly values for each individual parameter
# graph for locations vs the so2 concentrations
(filter(blr_data,
parameter == "so2")) %>%
mutate(Category = ifelse(concentration <= 40,
"<= 40 - good", "41-80 - satisfactory")) %>%
ggplot(aes(x = obv_date, y= concentration, color = Category)) +
geom_point(size = 0.5) +
geom_line() +
facet_wrap(~location) +
scale_color_manual(values = c("lightblue", "darkblue")) +
labs(title = "Sulphur dioxide (SO2) concentration in Bangalore Jan, 2018 - Dec, 2019",
x = "Day of observation",
y = "Cconcentration of PM in ug/m3")
# graph for peenya vs the so2 concentrations
(filter(blr_data,
parameter == "so2",
location == "peenya")) %>%
mutate(Category = ifelse(concentration <= 40,
"<= 40 - good", "41-80 - satisfactory")) %>%
ggplot(aes(x = obv_date, y= concentration, color = Category)) +
geom_point(size = 0.5) +
geom_line() +
scale_color_manual(values = c("lightblue", "darkblue")) +
labs(title = "Sulphur dioxide (SO2) concentration in Bangalore Jan, 2018 - Dec, 2019",
x = "Day of observation",
y = "Cconcentration of PM in ug/m3")
If we look at the above graphs, we can see the individual parameter measurements over the two years for all locations in Bangalore. In each of the maps, we can see that the values are mostly within the prescribed concentration of 40 ug/m3 of SO2, with a very few exceptions crossing that limit. Now let us try looking at this same data in a different manner.
pal <- wes_palette("Zissou1", 5, type = "discrete")
# graph for month vs the concentration of so2 in peenya
ggplot((filter(blr_data,
parameter == "so2",
location == "peenya")),
aes(x = observed_days, y = months_in_order,
fill = concentration)) +
geom_tile() +
scale_fill_gradientn(colours = pal) +
scale_x_discrete(expand = c(0, 0)) +
scale_y_discrete(expand = c(0, 0)) +
coord_equal() +
facet_wrap(~observed_year, ncol = 1) +
geom_text(aes(label=concentration), na.rm = T, size = 1, check_overlap = T) +
labs(title = "Sulphur dioxide (SO2) concentration in Peenya",
y = "month",
x = "Days of the month")
# graph for month vs the concentration of pm in peenya
ggplot((filter(blr_data,
parameter %in% c("pm10", "pm2.5"),
location == "peenya")),
aes(x = observed_days, y = months_in_order,
fill = concentration)) +
geom_tile() +
scale_fill_gradientn(colours = pal) +
scale_x_discrete(expand = c(0, 0)) +
scale_y_discrete(expand = c(0, 0)) +
coord_equal() +
facet_wrap(~observed_year, ncol = 1) +
geom_text(aes(label=concentration), na.rm = T, size = 1, check_overlap = T) +
labs(title = "Particulate matter (PM) concentration in Peenya",
y = "month",
x = "Days of the month")
# graph for month vs the concentration of no2 in peenya
ggplot((filter(blr_data,
parameter == "no2",
location == "peenya")),
aes(x = observed_days, y = months_in_order,
fill = concentration)) +
geom_tile() +
scale_fill_gradientn(colours = pal) +
scale_x_discrete(expand = c(0, 0)) +
scale_y_discrete(expand = c(0, 0)) +
coord_equal() +
facet_wrap(~observed_year, ncol = 1) +
geom_text(aes(label=concentration), na.rm = T, size = 1, check_overlap = T) +
labs(title = "Nitrogen dioxide (NO2) concentration in Peenya",
y = "month",
x = "Days of the month")
# graph for month vs the concentration of co in peenya
ggplot((filter(blr_data,
parameter == "co",
location == "peenya")),
aes(x = observed_days, y = months_in_order,
fill = concentration)) +
geom_tile() +
scale_fill_gradientn(colours = pal) +
scale_x_discrete(expand = c(0, 0)) +
scale_y_discrete(expand = c(0, 0)) +
coord_equal() +
facet_wrap(~observed_year, ncol = 1) +
geom_text(aes(label=concentration), na.rm = T, size = 1, check_overlap = T) +
labs(title = "Carbon monoxide (CO) concentration in Peenya",
y = "month",
x = "Days of the month")
From all the above graphs, when I look at them, the first thing that I notice are the abnormalities. After analysing all these graphs and the data, I have arrived at some questions. Why are the measurements on one day randomly so high? Is it random? What are the reasons behind such a spike? Do festival-events such as Diwali, which contribute a lot to air pollution, cause such a spike at random? Does the outside weather play any significant role in the mesurements? For example, during winters, we observe a higher pollution rate than we do in summers.
Another important observation that I have made from this analysis is that the location at which the reading are taken also plays a significant/an important role in determining the AQI of that location. For example, Peenya, an industrial area has a higher AQI than a residential area such as BTM or BWSSB.
For me , this research project has been a great challenge and equally a great opportunity. This research project can be used as the base to explore numerous other factors related to pollution and air pollution. The observations and inferences from this project led me to several other questions which can be answered by further research and extensive understanding of the topic of Air Quality Index.
In the future, with the current dataset, One can explore the possibilities of co-relations amongst the data. Another suggestion to move forward is to use data from other cities and compare city-wise data over the years. This way, we can gain a deeper understabding about AQI. Further, datasets on aspects such as pollution levels due to events (festivals) can be collected and used for co-relating and comparing. Also, datasets on the efforts to fight air pollution can be taken and used for co-relating - are the efforts really paying off? If datasets for all many years is collected, it can be used to build a model for prediction of future AQI. These are some of the pathways that can be explored if chosen to expand on this topic.